The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
We need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
We need to identify the best possible model that will give the required performance
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings('ignore')
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Libraries to split data, impute missing values
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.under_sampling import ClusterCentroids
#Loading the dataset
data=pd.read_csv("BankChurners.csv")
data.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | ... | 2 | 3 | 4003.0 | 1851 | 2152.0 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | ... | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | ... | 3 | 4 | 5409.0 | 0 | 5409.0 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | ... | 3 | 3 | 5281.0 | 0 | 5281.0 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | ... | 2 | 4 | 10388.0 | 1961 | 8427.0 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
5 rows × 21 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Check the percentage of missing values in each column
pd.DataFrame(data={'% of Missing Values':round(data.isna().sum()/data.isna().count()*100,2)})
| % of Missing Values | |
|---|---|
| CLIENTNUM | 0.0 |
| Attrition_Flag | 0.0 |
| Customer_Age | 0.0 |
| Gender | 0.0 |
| Dependent_count | 0.0 |
| Education_Level | 15.0 |
| Marital_Status | 7.4 |
| Income_Category | 0.0 |
| Card_Category | 0.0 |
| Months_on_book | 0.0 |
| Total_Relationship_Count | 0.0 |
| Months_Inactive_12_mon | 0.0 |
| Contacts_Count_12_mon | 0.0 |
| Credit_Limit | 0.0 |
| Total_Revolving_Bal | 0.0 |
| Avg_Open_To_Buy | 0.0 |
| Total_Amt_Chng_Q4_Q1 | 0.0 |
| Total_Trans_Amt | 0.0 |
| Total_Trans_Ct | 0.0 |
| Total_Ct_Chng_Q4_Q1 | 0.0 |
| Avg_Utilization_Ratio | 0.0 |
Education_Level column has 15% missing values out of the total observations.Marital_Status column has 7.4% missing values out of the total observations.Let's check the number of unique values in each column
data.nunique()
CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 6 Marital_Status 3 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
#Dropping CLIENTNUM column
data.drop(columns='CLIENTNUM',inplace=True)
Summary of the data
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.0 | 2.346203 | 1.298908 | 0.0 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.0 | 3.812580 | 1.554408 | 1.0 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.0 | 2.341167 | 1.010622 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.0 | 2.455317 | 1.106225 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
Observations
Customer_Age, Months_on_book, Dependent_count appears to be normally distributed with mean and median almost the sameCredit_Limit , Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt , Total_Ct_Chng_Q4_Q1 seems to have outliers , with max value way higher than the 75th percentileAvg_Utilization_Ratio is 0 implying that there are some customers who are not utilizing the credit carddata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(9), object(6) memory usage: 1.5+ MB
data.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Let's check the count of each unique category in each of the categorical variables.
#Making a list of all categorical variables
cat_col=['Attrition_Flag', 'Gender','Education_Level', 'Marital_Status', 'Income_Category',
'Card_Category']
#Printing number of count of each unique value in each column
for column in cat_col:
print(data[column].value_counts())
print('-'*50)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 -------------------------------------------------- F 5358 M 4769 Name: Gender, dtype: int64 -------------------------------------------------- Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 -------------------------------------------------- Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 -------------------------------------------------- Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 -------------------------------------------------- Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 --------------------------------------------------
abc values in Income_Category to nan so we can impute these values later¶data['Income_Category'] = data['Income_Category'].replace("abc", np.nan)
Observations
Blueabc , We will handle this in missing value treatment later. data.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Plot the histogram a nd boxplot of independant variable Age
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Customer_Age", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Dependent_count
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Dependent_count", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Months_on_book
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Months_on_book", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Total_Relationship_Count
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Total_Relationship_Count", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Months_Inactive_12_mon
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Months_Inactive_12_mon", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Contacts_Count_12_mon
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Contacts_Count_12_mon", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Credit_Limit
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Credit_Limit", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Total_Revolving_Bal
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Total_Revolving_Bal", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Avg_Open_To_Buy
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Avg_Open_To_Buy", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Total_Amt_Chng_Q4_Q1
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Total_Amt_Chng_Q4_Q1", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Total_Trans_Amt
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Total_Trans_Amt", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Total_Trans_Ct
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Total_Trans_Ct", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Total_Ct_Chng_Q4_Q1
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Total_Ct_Chng_Q4_Q1", bins=70)
<Figure size 1440x720 with 0 Axes>
# Plot the histogram and boxplot of independant variable Avg_Utilization_Ratio
plt.figure(figsize=(20,10))
histogram_boxplot(data, "Avg_Utilization_Ratio", bins=70)
<Figure size 1440x720 with 0 Axes>
# visualize outliers using boxplot
cols_num = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,30))
for i, variable in enumerate(cols_num):
plt.subplot(5,4,i+1)
plt.boxplot(data[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Credit Limit, Total_ct_chng_Q4_Q1, Avg_Open_to_buy, total_trans_amt seems to have a lot of outliers.
At this point we will not treat them since detecting these as true outliers needs business domain expertise. Therefore we will reevaulate these if needed, based on how the model frames.
Let's define a function to create barplots for the categorical variables indicating percentage of each category for that variables.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Plot the barplot of independant variable
labeled_barplot(data, "Attrition_Flag", perc=True)
# Plot the barplot of Gender
labeled_barplot(data, "Gender", perc=True)
# Plot the barplot of Education_Level
labeled_barplot(data, "Education_Level", perc=True)
# Plot the barplot of Marital_Status
labeled_barplot(data, "Marital_Status", perc=True)
# Plot the barplot of Income_Category
labeled_barplot(data, "Income_Category", perc=True)
# Plot the barplot of Card_Category
labeled_barplot(data, "Card_Category", perc=True)
# Plot the heatmap to check for correlations
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, cmap="Spectral")
plt.show()
sns.pairplot(data=data,hue='Attrition_Flag')
<seaborn.axisgrid.PairGrid at 0x7fbf782f3100>
### Plot numerical variables against the dependent variable attirition_flag
cols = data[
[
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit"
]
].columns.tolist()
plt.figure(figsize=(12, 7))
for i, variable in enumerate(cols):
plt.subplot(3, 2, i + 1)
sns.boxplot(data["Attrition_Flag"], data[variable], palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
### Plot numerical variables against the dependent variable attirition_flag
cols = data[
[
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Trans_Amt",
"Total_Ct_Chng_Q4_Q1",
"Total_Amt_Chng_Q4_Q1",
"Avg_Utilization_Ratio"
]
].columns.tolist()
plt.figure(figsize=(12, 7))
for i, variable in enumerate(cols):
plt.subplot(3, 2, i + 1)
sns.boxplot(data["Attrition_Flag"], data[variable], palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
### Plot numerical variables against the dependent variable attirition_flag
cols = data[
[
"Total_Trans_Ct"
]
].columns.tolist()
plt.figure(figsize=(12, 10))
for i, variable in enumerate(cols):
plt.subplot(3, 2, i + 1)
sns.boxplot(data["Attrition_Flag"], data[variable], palette="PuBu")
plt.tight_layout()
plt.title(variable)
plt.show()
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_barplot(data, "Gender", "Attrition_Flag" )
Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Education_Level", "Attrition_Flag" )
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Marital_Status", "Attrition_Flag" )
Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Income_Category", "Attrition_Flag" )
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1440 7575 9015 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 $120K + 126 601 727 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Card_Category", "Attrition_Flag" )
Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Contacts_Count_12_mon", "Attrition_Flag" )
Attrition_Flag Attrited Customer Existing Customer All Contacts_Count_12_mon All 1627 8500 10127 3 681 2699 3380 2 403 2824 3227 4 315 1077 1392 1 108 1391 1499 5 59 117 176 6 54 0 54 0 7 392 399 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Months_Inactive_12_mon", "Attrition_Flag" )
Attrition_Flag Attrited Customer Existing Customer All Months_Inactive_12_mon All 1627 8500 10127 3 826 3020 3846 2 505 2777 3282 4 130 305 435 1 100 2133 2233 5 32 146 178 6 19 105 124 0 15 14 29 ------------------------------------------------------------------------------------------------------------------------
data.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
data['Attrition_Flag'].unique()
array(['Existing Customer', 'Attrited Customer'], dtype=object)
# Replace the attrition_flag column values to include true/false classification , and convert the datatype to bool
data.replace({"Attrition_Flag": {'Existing Customer' : '0' , 'Attrited Customer' : '1'}}, inplace=True)
data['Attrition_Flag'] = data['Attrition_Flag'].astype('int64')
#Separating target variable and other variables
X=data.drop(columns='Attrition_Flag')
Y=data['Attrition_Flag']
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=Y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_temp.shape , X_train.shape, X_val.shape, X_test.shape)
(8101, 19) (6075, 19) (2026, 19) (2026, 19)
Note: We will use the X_test data as training data in models where we do not use cross validation , and X_train and X_val will be used in models where we use cross validation
As we saw earlier, our data has missing values. We will impute missing values using mode since all missing values are for categorical variables. We will use SimpleImputer to do this.
si2=SimpleImputer(strategy='most_frequent')
mode_imputed_col=['Education_Level','Marital_Status','Income_Category']
#Fit and transform the temp data
X_temp[mode_imputed_col]=si2.fit_transform(X_temp[mode_imputed_col])
#Fit and transform the train data
X_train[mode_imputed_col]=si2.fit_transform(X_train[mode_imputed_col])
#Fit and transform the validation data
X_val[mode_imputed_col]=si2.fit_transform(X_val[mode_imputed_col])
#Transform the test data i.e. replace missing values with the mode calculated using training data
X_test[mode_imputed_col]=si2.transform(X_test[mode_imputed_col])
X_train.isnull().sum()
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
#Checking that no column has missing values in train , validation, or test sets
print(X_temp.isna().sum())
print('-'*30)
print(X_train.isna().sum())
print('-'*30)
print(X_val.isna().sum())
print('-'*30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Let's create dummy variables for string type variables and convert other column types back to float.
X_train.head()
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 800 | 40 | M | 2 | Graduate | Single | $120K + | Blue | 21 | 6 | 4 | 3 | 20056.0 | 1602 | 18454.0 | 0.466 | 1687 | 46 | 0.533 | 0.080 |
| 498 | 44 | M | 1 | Graduate | Married | Less than $40K | Blue | 34 | 6 | 2 | 0 | 2885.0 | 1895 | 990.0 | 0.387 | 1366 | 31 | 0.632 | 0.657 |
| 4356 | 48 | M | 4 | High School | Married | $80K - $120K | Blue | 36 | 5 | 1 | 2 | 6798.0 | 2517 | 4281.0 | 0.873 | 4327 | 79 | 0.881 | 0.370 |
| 407 | 41 | M | 2 | Graduate | Married | $60K - $80K | Silver | 36 | 6 | 2 | 0 | 27000.0 | 0 | 27000.0 | 0.610 | 1209 | 39 | 0.300 | 0.000 |
| 8728 | 46 | M | 4 | High School | Divorced | $40K - $60K | Silver | 36 | 2 | 2 | 3 | 15034.0 | 1356 | 13678.0 | 0.754 | 7737 | 84 | 0.750 | 0.090 |
- Income_Category
- Education_Level
- Card_Category
X_train['Income_Category'].unique()
array(['$120K +', 'Less than $40K', '$80K - $120K', '$60K - $80K',
'$40K - $60K'], dtype=object)
# Encode the Income_Category Column
Income_Category = {'Less than $40K' : 1, '$40K - $60K' : 2, '$60K - $80K' : 3, '$80K - $120K' : 4, '$120K +' : 5}
X_temp.replace({"Income_Category": Income_Category}, inplace=True)
X_train.replace({"Income_Category": Income_Category}, inplace=True)
X_val.replace({"Income_Category": Income_Category}, inplace=True)
X_test.replace({"Income_Category": Income_Category}, inplace=True)
X_train['Education_Level'].unique()
array(['Graduate', 'High School', 'Uneducated', 'College', 'Doctorate',
'Post-Graduate'], dtype=object)
# Encode the Education_Level Column
Education_Level = {'Uneducated' : 1, 'High School' : 2, 'College' : 3, 'Graduate' : 4, 'Post-Graduate' : 5 , 'Doctorate' : 6}
X_temp.replace({"Education_Level": Education_Level}, inplace=True)
X_train.replace({"Education_Level": Education_Level}, inplace=True)
X_val.replace({"Education_Level": Education_Level}, inplace=True)
X_test.replace({"Education_Level": Education_Level}, inplace=True)
X_train['Card_Category'].unique()
array(['Blue', 'Silver', 'Gold', 'Platinum'], dtype=object)
Card_Category = {'Blue' : 1, 'Silver' : 2, 'Gold' : 3, 'Platinum' : 4}
X_temp.replace({"Card_Category": Card_Category}, inplace=True)
X_train.replace({"Card_Category": Card_Category}, inplace=True)
X_val.replace({"Card_Category": Card_Category}, inplace=True)
X_test.replace({"Card_Category": Card_Category}, inplace=True)
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6075 entries, 800 to 4035 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 6075 non-null int64 1 Gender 6075 non-null object 2 Dependent_count 6075 non-null int64 3 Education_Level 6075 non-null int64 4 Marital_Status 6075 non-null object 5 Income_Category 6075 non-null int64 6 Card_Category 6075 non-null int64 7 Months_on_book 6075 non-null int64 8 Total_Relationship_Count 6075 non-null int64 9 Months_Inactive_12_mon 6075 non-null int64 10 Contacts_Count_12_mon 6075 non-null int64 11 Credit_Limit 6075 non-null float64 12 Total_Revolving_Bal 6075 non-null int64 13 Avg_Open_To_Buy 6075 non-null float64 14 Total_Amt_Chng_Q4_Q1 6075 non-null float64 15 Total_Trans_Amt 6075 non-null int64 16 Total_Trans_Ct 6075 non-null int64 17 Total_Ct_Chng_Q4_Q1 6075 non-null float64 18 Avg_Utilization_Ratio 6075 non-null float64 dtypes: float64(5), int64(12), object(2) memory usage: 949.2+ KB
#converting object/string data types of columns to category
for column in ['Marital_Status', 'Gender']:
X_temp[column]=X_temp[column].astype('category')
X_train[column]=X_train[column].astype('category')
X_val[column]=X_val[column].astype('category')
X_test[column]=X_test[column].astype('category')
#List of columns to create a dummy variables
col_dummy=['Gender', 'Marital_Status']
#Encoding categorical varaibles
X_temp=pd.get_dummies(X_temp, columns=col_dummy, drop_first=True)
X_train=pd.get_dummies(X_train, columns=col_dummy, drop_first=True)
X_val=pd.get_dummies(X_val, columns=col_dummy, drop_first=True)
X_test=pd.get_dummies(X_test, columns=col_dummy, drop_first=True)
X_train.head()
| Customer_Age | Dependent_count | Education_Level | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Marital_Status_Married | Marital_Status_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 800 | 40 | 2 | 4 | 5 | 1 | 21 | 6 | 4 | 3 | 20056.0 | 1602 | 18454.0 | 0.466 | 1687 | 46 | 0.533 | 0.080 | 1 | 0 | 1 |
| 498 | 44 | 1 | 4 | 1 | 1 | 34 | 6 | 2 | 0 | 2885.0 | 1895 | 990.0 | 0.387 | 1366 | 31 | 0.632 | 0.657 | 1 | 1 | 0 |
| 4356 | 48 | 4 | 2 | 4 | 1 | 36 | 5 | 1 | 2 | 6798.0 | 2517 | 4281.0 | 0.873 | 4327 | 79 | 0.881 | 0.370 | 1 | 1 | 0 |
| 407 | 41 | 2 | 4 | 3 | 2 | 36 | 6 | 2 | 0 | 27000.0 | 0 | 27000.0 | 0.610 | 1209 | 39 | 0.300 | 0.000 | 1 | 1 | 0 |
| 8728 | 46 | 4 | 2 | 2 | 2 | 36 | 2 | 2 | 3 | 15034.0 | 1356 | 13678.0 | 0.754 | 7737 | 84 | 0.750 | 0.090 | 1 | 0 | 0 |
Let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
#Fitting the model
m1_lg = LogisticRegression(solver="newton-cg", random_state=1)
m1_lg.fit(X_train,y_train)
#Calculating different metrics
m1_lg_model_train_perf=model_performance_classification_sklearn(m1_lg, X_train,y_train)
print("Training performance:\n", m1_lg_model_train_perf)
m1_lg_model_test_perf=model_performance_classification_sklearn(m1_lg, X_test,y_test)
print("Testing performance:\n", m1_lg_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(m1_lg,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.899588 0.552254 0.757022 0.638626
Testing performance:
Accuracy Recall Precision F1
0 0.904738 0.572308 0.775 0.658407
X_temp.head()
| Customer_Age | Dependent_count | Education_Level | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Marital_Status_Married | Marital_Status_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3105 | 53 | 2 | 4 | 2 | 1 | 37 | 5 | 3 | 4 | 7282.0 | 0 | 7282.0 | 0.740 | 3364 | 69 | 0.816 | 0.000 | 0 | 0 | 1 |
| 3721 | 44 | 4 | 4 | 1 | 1 | 37 | 4 | 3 | 3 | 5826.0 | 0 | 5826.0 | 0.689 | 3756 | 73 | 0.921 | 0.000 | 0 | 1 | 0 |
| 3389 | 50 | 3 | 1 | 1 | 1 | 41 | 4 | 2 | 2 | 2563.0 | 1860 | 703.0 | 0.680 | 3774 | 83 | 0.804 | 0.726 | 0 | 0 | 1 |
| 3552 | 50 | 1 | 4 | 4 | 1 | 30 | 6 | 2 | 3 | 9771.0 | 1776 | 7995.0 | 0.460 | 2778 | 53 | 0.472 | 0.182 | 1 | 1 | 0 |
| 398 | 55 | 0 | 2 | 5 | 1 | 49 | 5 | 3 | 3 | 3805.0 | 2233 | 1572.0 | 1.095 | 1743 | 27 | 0.929 | 0.587 | 1 | 1 | 0 |
#Fitting the model
m2_d_tree = DecisionTreeClassifier(random_state=1)
m2_d_tree.fit(X_train,y_train)
#Calculating different metrics
m2_d_tree_model_train_perf=model_performance_classification_sklearn(m2_d_tree, X_train,y_train)
print("Training performance:\n", m2_d_tree_model_train_perf)
m2_d_tree_model_test_perf=model_performance_classification_sklearn(m2_d_tree, X_test,y_test)
print("Testing performance:\n", m2_d_tree_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(m2_d_tree,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.935834 0.830769 0.782609 0.80597
#We will build a bagging classifier with decision tree, which is the default base estimator
m3_bagging_estimator=BaggingClassifier(random_state=1)
m3_bagging_estimator.fit(X_train,y_train)
#Calculating different metrics
m3_bagging_estimator_train_perf=model_performance_classification_sklearn(m3_bagging_estimator,X_train,y_train)
print("Training performance:\n",m3_bagging_estimator_train_perf)
m3_bagging_estimator_test_perf=model_performance_classification_sklearn(m3_bagging_estimator,X_test,y_test)
print("Testing performance:\n",m3_bagging_estimator_test_perf)
confusion_matrix_sklearn(m3_bagging_estimator,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.997202 0.985656 0.996891 0.991242
Testing performance:
Accuracy Recall Precision F1
0 0.956565 0.861538 0.866873 0.864198
#Train the random forest classifier
m4_rf_estimator=RandomForestClassifier(random_state=1)
m4_rf_estimator.fit(X_train,y_train)
#Calculating different metrics
m4_rf_estimator_train_perf=model_performance_classification_sklearn(m4_rf_estimator,X_train,y_train)
print("Training performance:\n",m4_rf_estimator_train_perf)
m4_rf_estimator_test_perf=model_performance_classification_sklearn(m4_rf_estimator,X_test,y_test)
print("Testing performance:\n",m4_rf_estimator_test_perf)
confusion_matrix_sklearn(m4_rf_estimator,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.956565 0.812308 0.907216 0.857143
m5_adaBoost_classfr = AdaBoostClassifier(random_state=1)
m5_adaBoost_classfr.fit(X_train,y_train)
#Calculating different metrics
m5_adaBoost_classfr_train_perf=model_performance_classification_sklearn(m5_adaBoost_classfr,X_train,y_train)
print("Training performance:\n",m5_adaBoost_classfr_train_perf)
m5_adaBoost_classfr_test_perf=model_performance_classification_sklearn(m5_adaBoost_classfr,X_test,y_test)
print("Testing performance:\n",m5_adaBoost_classfr_test_perf)
confusion_matrix_sklearn(m5_adaBoost_classfr,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.958519 0.847336 0.889247 0.867786
Testing performance:
Accuracy Recall Precision F1
0 0.968411 0.901538 0.901538 0.901538
m6_gradBoost_classfr = GradientBoostingClassifier(random_state=1)
m6_gradBoost_classfr.fit(X_train,y_train)
#Calculating different metrics
m6_gradBoost_classfr_train_perf=model_performance_classification_sklearn(m6_gradBoost_classfr,X_train,y_train)
print("Training performance:\n",m6_gradBoost_classfr_train_perf)
m6_gradBoost_classfr_test_perf=model_performance_classification_sklearn(m6_gradBoost_classfr,X_test,y_test)
print("Testing performance:\n",m6_gradBoost_classfr_test_perf)
confusion_matrix_sklearn(m6_gradBoost_classfr,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.974156 0.881148 0.954495 0.916356
Testing performance:
Accuracy Recall Precision F1
0 0.970879 0.886154 0.929032 0.907087
# training performance comparison
models_train_comp_df = pd.concat(
[m1_lg_model_train_perf.T, m2_d_tree_model_train_perf.T,m3_bagging_estimator_train_perf.T,m4_rf_estimator_train_perf.T,
m5_adaBoost_classfr_train_perf.T,m6_gradBoost_classfr_train_perf.T ],
axis=1,
)
models_train_comp_df.columns = [
"M1 : Logistic Regression",
"M2 : Decision Tree",
"M3 : Bagging with Decision Tree",
"M4 : Random Forest Estimator",
"M5 : AdaBoost Classifier",
"M6 : GradientBoost Classifier"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| M1 : Logistic Regression | M2 : Decision Tree | M3 : Bagging with Decision Tree | M4 : Random Forest Estimator | M5 : AdaBoost Classifier | M6 : GradientBoost Classifier | |
|---|---|---|---|---|---|---|
| Accuracy | 0.899588 | 1.0 | 0.997202 | 1.0 | 0.958519 | 0.974156 |
| Recall | 0.552254 | 1.0 | 0.985656 | 1.0 | 0.847336 | 0.881148 |
| Precision | 0.757022 | 1.0 | 0.996891 | 1.0 | 0.889247 | 0.954495 |
| F1 | 0.638626 | 1.0 | 0.991242 | 1.0 | 0.867786 | 0.916356 |
# testing performance comparison
models_test_comp_df = pd.concat(
[m1_lg_model_test_perf.T, m2_d_tree_model_test_perf.T,m3_bagging_estimator_test_perf.T,m4_rf_estimator_test_perf.T,
m5_adaBoost_classfr_test_perf.T,m6_gradBoost_classfr_test_perf.T ],
axis=1,
)
models_test_comp_df.columns = [
"M1 : Logistic Regression",
"M2 : Decision Tree",
"M3 : Bagging with Decision Tree",
"M4 : Random Forest Estimator",
"M5 : AdaBoost Classifier",
"M6 : GradientBoost Classifier"
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| M1 : Logistic Regression | M2 : Decision Tree | M3 : Bagging with Decision Tree | M4 : Random Forest Estimator | M5 : AdaBoost Classifier | M6 : GradientBoost Classifier | |
|---|---|---|---|---|---|---|
| Accuracy | 0.904738 | 0.935834 | 0.956565 | 0.956565 | 0.968411 | 0.970879 |
| Recall | 0.572308 | 0.830769 | 0.861538 | 0.812308 | 0.901538 | 0.886154 |
| Precision | 0.775000 | 0.782609 | 0.866873 | 0.907216 | 0.901538 | 0.929032 |
| F1 | 0.658407 | 0.805970 | 0.864198 | 0.857143 | 0.901538 | 0.907087 |
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_temp == 1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_temp == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 1302 Before UpSampling, counts of label 'No': 6799 After UpSampling, counts of label 'Yes': 5099 After UpSampling, counts of label 'No': 5099 After UpSampling, the shape of train_X: (10198, 20) After UpSampling, the shape of train_y: (10198,)
#Fitting the model
m7_lg = LogisticRegression(solver="newton-cg", random_state=1)
m7_lg.fit(X_train_over,y_train_over)
#Calculating different metrics
m7_lg_model_train_perf=model_performance_classification_sklearn(m7_lg, X_train_over,y_train_over)
print("Training performance:\n", m7_lg_model_train_perf)
m7_lg_model_test_perf=model_performance_classification_sklearn(m7_lg, X_test,y_test)
print("Testing performance:\n", m7_lg_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(m7_lg,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.900863 0.899392 0.902046 0.900717
Testing performance:
Accuracy Recall Precision F1
0 0.884995 0.775385 0.61165 0.683853
#Fitting the model
m8_d_tree = DecisionTreeClassifier(random_state=1)
m8_d_tree.fit(X_train_over,y_train_over)
#Calculating different metrics
m8_d_tree_model_train_perf=model_performance_classification_sklearn(m8_d_tree, X_train_over,y_train_over)
print("Training performance:\n", m8_d_tree_model_train_perf)
m8_d_tree_model_test_perf=model_performance_classification_sklearn(m8_d_tree, X_test,y_test)
print("Testing performance:\n", m8_d_tree_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(m8_d_tree,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.927937 0.873846 0.730077 0.795518
#We will build a bagging classifier with decision tree, which is the default base estimator
m9_bagging_estimator=BaggingClassifier(random_state=1)
m9_bagging_estimator.fit(X_train_over,y_train_over)
#Calculating different metrics
m9_bagging_estimator_train_perf=model_performance_classification_sklearn(m9_bagging_estimator,X_train_over,y_train_over)
print("Training performance:\n",m9_bagging_estimator_train_perf)
m9_bagging_estimator_test_perf=model_performance_classification_sklearn(m9_bagging_estimator,X_test,y_test)
print("Testing performance:\n",m9_bagging_estimator_test_perf)
confusion_matrix_sklearn(m9_bagging_estimator,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.998529 0.99745 0.999607 0.998528
Testing performance:
Accuracy Recall Precision F1
0 0.946199 0.876923 0.805085 0.83947
#Train the random forest classifier
m10_rf_estimator=RandomForestClassifier(random_state=1)
m10_rf_estimator.fit(X_train_over,y_train_over)
#Calculating different metrics
m10_rf_estimator_train_perf=model_performance_classification_sklearn(m10_rf_estimator,X_train_over,y_train_over)
print("Training performance:\n",m10_rf_estimator_train_perf)
m10_rf_estimator_test_perf=model_performance_classification_sklearn(m10_rf_estimator,X_test,y_test)
print("Testing performance:\n",m10_rf_estimator_test_perf)
confusion_matrix_sklearn(m10_rf_estimator,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.960513 0.901538 0.859238 0.87988
m11_adaBoost_classfr = AdaBoostClassifier(random_state=1)
m11_adaBoost_classfr.fit(X_train_over,y_train_over)
#Calculating different metrics
m11_adaBoost_classfr_train_perf=model_performance_classification_sklearn(m11_adaBoost_classfr,X_train_over,y_train_over)
print("Training performance:\n",m11_adaBoost_classfr_train_perf)
m11_adaBoost_classfr_test_perf=model_performance_classification_sklearn(m11_adaBoost_classfr,X_test,y_test)
print("Testing performance:\n",m11_adaBoost_classfr_test_perf)
confusion_matrix_sklearn(m11_adaBoost_classfr,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.956462 0.961169 0.952205 0.956666
Testing performance:
Accuracy Recall Precision F1
0 0.94077 0.938462 0.753086 0.835616
m12_gradBoost_classfr = GradientBoostingClassifier(random_state=1)
m12_gradBoost_classfr.fit(X_train_over,y_train_over)
#Calculating different metrics
m12_gradBoost_classfr_train_perf=model_performance_classification_sklearn(m12_gradBoost_classfr,X_train_over,y_train_over)
print("Training performance:\n",m12_gradBoost_classfr_train_perf)
m12_gradBoost_classfr_test_perf=model_performance_classification_sklearn(m12_gradBoost_classfr,X_test,y_test)
print("Testing performance:\n",m12_gradBoost_classfr_test_perf)
confusion_matrix_sklearn(m12_gradBoost_classfr,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.97323 0.979212 0.967636 0.973389
Testing performance:
Accuracy Recall Precision F1
0 0.957058 0.947692 0.814815 0.876245
# training performance comparison
models_train_comp_df_os = pd.concat(
[m7_lg_model_train_perf.T, m8_d_tree_model_train_perf.T,m9_bagging_estimator_train_perf.T,m10_rf_estimator_train_perf.T,
m11_adaBoost_classfr_train_perf.T,m12_gradBoost_classfr_train_perf.T ],
axis=1,
)
models_train_comp_df_os.columns = [
"M7 : Logistic Regression (OverSampling)",
"M8 : Decision Tree (OverSampling)",
"M9 : Bagging with Decision Tree (OverSampling)",
"M10 : Random Forest Estimator (OverSampling)",
"M11 : AdaBoost Classifier (OverSampling)",
"M12: GradientBoost Classifier (OverSampling)"
]
print("Training performance comparison (OverSampling):")
models_train_comp_df_os
Training performance comparison (OverSampling):
| M7 : Logistic Regression (OverSampling) | M8 : Decision Tree (OverSampling) | M9 : Bagging with Decision Tree (OverSampling) | M10 : Random Forest Estimator (OverSampling) | M11 : AdaBoost Classifier (OverSampling) | M12: GradientBoost Classifier (OverSampling) | |
|---|---|---|---|---|---|---|
| Accuracy | 0.900863 | 1.0 | 0.998529 | 1.0 | 0.956462 | 0.973230 |
| Recall | 0.899392 | 1.0 | 0.997450 | 1.0 | 0.961169 | 0.979212 |
| Precision | 0.902046 | 1.0 | 0.999607 | 1.0 | 0.952205 | 0.967636 |
| F1 | 0.900717 | 1.0 | 0.998528 | 1.0 | 0.956666 | 0.973389 |
# testing performance comparison
models_test_comp_df_os = pd.concat(
[m7_lg_model_test_perf.T, m8_d_tree_model_test_perf.T,m9_bagging_estimator_test_perf.T,m10_rf_estimator_test_perf.T,
m11_adaBoost_classfr_test_perf.T,m12_gradBoost_classfr_test_perf.T ],
axis=1,
)
models_test_comp_df_os.columns = [
"M7 : Logistic Regression (OverSampling)",
"M8 : Decision Tree (OverSampling)",
"M9 : Bagging with Decision Tree (OverSampling)",
"M10: Random Forest Estimator (OverSampling)",
"M11: AdaBoost Classifier (OverSampling)",
"M12: GradientBoost Classifier (OverSampling)"
]
print("Testing performance comparison (OverSampling):")
models_test_comp_df_os
Testing performance comparison (OverSampling):
| M7 : Logistic Regression (OverSampling) | M8 : Decision Tree (OverSampling) | M9 : Bagging with Decision Tree (OverSampling) | M10: Random Forest Estimator (OverSampling) | M11: AdaBoost Classifier (OverSampling) | M12: GradientBoost Classifier (OverSampling) | |
|---|---|---|---|---|---|---|
| Accuracy | 0.884995 | 0.927937 | 0.946199 | 0.960513 | 0.940770 | 0.957058 |
| Recall | 0.775385 | 0.873846 | 0.876923 | 0.901538 | 0.938462 | 0.947692 |
| Precision | 0.611650 | 0.730077 | 0.805085 | 0.859238 | 0.753086 | 0.814815 |
| F1 | 0.683853 | 0.795518 | 0.839470 | 0.879880 | 0.835616 | 0.876245 |
cc = ClusterCentroids()
X_train_under, y_train_under = cc.fit_resample(X_train, y_train)
print("Before DownSampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before DownSampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After DownSampling, counts of label 'Yes': {}".format(sum(y_train_under == 1)))
print("After DownSampling, counts of label 'No': {} \n".format(sum(y_train_under == 0)))
print("After DownSampling, the shape of train_X: {}".format(X_train_under.shape))
print("After DownSampling, the shape of train_y: {} \n".format(y_train_under.shape))
Before DownSampling, counts of label 'Yes': 976 Before DownSampling, counts of label 'No': 5099 After DownSampling, counts of label 'Yes': 976 After DownSampling, counts of label 'No': 976 After DownSampling, the shape of train_X: (1952, 20) After DownSampling, the shape of train_y: (1952,)
#Fitting the model
m13_lg = LogisticRegression(solver="newton-cg", random_state=1)
m13_lg.fit(X_train_under,y_train_under)
#Calculating different metrics
m13_lg_model_train_perf=model_performance_classification_sklearn(m13_lg, X_train_under,y_train_under)
print("Training performance:\n", m13_lg_model_train_perf)
m13_lg_model_test_perf=model_performance_classification_sklearn(m13_lg, X_test,y_test)
print("Testing performance:\n", m13_lg_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(m13_lg,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.941598 0.940574 0.942505 0.941538
Testing performance:
Accuracy Recall Precision F1
0 0.594274 0.950769 0.27713 0.429167
#Fitting the model
m14_d_tree = DecisionTreeClassifier(random_state=1)
m14_d_tree.fit(X_train_under,y_train_under)
#Calculating different metrics
m14_d_tree_model_train_perf=model_performance_classification_sklearn(m14_d_tree, X_train_under,y_train_under)
print("Training performance:\n", m14_d_tree_model_train_perf)
m14_d_tree_model_test_perf=model_performance_classification_sklearn(m14_d_tree, X_test,y_test)
print("Testing performance:\n", m14_d_tree_model_test_perf)
#Creating confusion matrix
confusion_matrix_sklearn(m14_d_tree,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.721125 0.932308 0.358156 0.517506
#We will build a bagging classifier with decision tree, which is the default base estimator
m15_bagging_estimator=BaggingClassifier(random_state=1)
m15_bagging_estimator.fit(X_train_under,y_train_under)
#Calculating different metrics
m15_bagging_estimator_train_perf=model_performance_classification_sklearn(m15_bagging_estimator,X_train_under,y_train_under)
print("Training performance:\n",m15_bagging_estimator_train_perf)
m15_bagging_estimator_test_perf=model_performance_classification_sklearn(m15_bagging_estimator,X_test,y_test)
print("Testing performance:\n",m15_bagging_estimator_test_perf)
confusion_matrix_sklearn(m15_bagging_estimator,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.996414 0.994877 0.997945 0.996408
Testing performance:
Accuracy Recall Precision F1
0 0.780849 0.966154 0.420348 0.585821
#Train the random forest classifier
m16_rf_estimator=RandomForestClassifier(random_state=1)
m16_rf_estimator.fit(X_train_under,y_train_under)
#Calculating different metrics
m16_rf_estimator_train_perf=model_performance_classification_sklearn(m16_rf_estimator,X_train_under,y_train_under)
print("Training performance:\n",m16_rf_estimator_train_perf)
m16_rf_estimator_test_perf=model_performance_classification_sklearn(m16_rf_estimator,X_test,y_test)
print("Testing performance:\n",m16_rf_estimator_test_perf)
confusion_matrix_sklearn(m16_rf_estimator,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.69694 0.978462 0.343784 0.5088
m17_adaBoost_classfr = AdaBoostClassifier(random_state=1)
m17_adaBoost_classfr.fit(X_train_under,y_train_under)
#Calculating different metrics
m17_adaBoost_classfr_train_perf=model_performance_classification_sklearn(m17_adaBoost_classfr,X_train_under,y_train_under)
print("Training performance:\n",m17_adaBoost_classfr_train_perf)
m17_adaBoost_classfr_test_perf=model_performance_classification_sklearn(m17_adaBoost_classfr,X_test,y_test)
print("Testing performance:\n",m17_adaBoost_classfr_test_perf)
confusion_matrix_sklearn(m17_adaBoost_classfr,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.985656 0.987705 0.983673 0.985685
Testing performance:
Accuracy Recall Precision F1
0 0.721619 0.978462 0.363429 0.53
m18_gradBoost_classfr = GradientBoostingClassifier(random_state=1)
m18_gradBoost_classfr.fit(X_train_under,y_train_under)
#Calculating different metrics
m18_gradBoost_classfr_train_perf=model_performance_classification_sklearn(m18_gradBoost_classfr,X_train_under,y_train_under)
print("Training performance:\n",m18_gradBoost_classfr_train_perf)
m18_gradBoost_classfr_test_perf=model_performance_classification_sklearn(m18_gradBoost_classfr,X_test,y_test)
print("Testing performance:\n",m18_gradBoost_classfr_test_perf)
confusion_matrix_sklearn(m18_gradBoost_classfr,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.996926 0.996926 0.996926 0.996926
Testing performance:
Accuracy Recall Precision F1
0 0.674729 0.984615 0.328542 0.492687
# training performance comparison
models_train_comp_df_us = pd.concat(
[m13_lg_model_train_perf.T, m14_d_tree_model_train_perf.T,m15_bagging_estimator_train_perf.T,m16_rf_estimator_train_perf.T,
m17_adaBoost_classfr_train_perf.T,m18_gradBoost_classfr_train_perf.T ],
axis=1,
)
models_train_comp_df_us.columns = [
"M13 : Logistic Regression (UnderSampling)",
"M14 : Decision Tree (UnderSampling)",
"M15 : Bagging with Decision Tree (UnderSampling)",
"M16 : Random Forest Estimator (UnderSampling)",
"M17 : AdaBoost Classifier (UnderSampling)",
"M18 : GradientBoost Classifier (UnderSampling)"
]
print("Training performance comparison (UnderSampling):")
models_train_comp_df_us
Training performance comparison (UnderSampling):
| M13 : Logistic Regression (UnderSampling) | M14 : Decision Tree (UnderSampling) | M15 : Bagging with Decision Tree (UnderSampling) | M16 : Random Forest Estimator (UnderSampling) | M17 : AdaBoost Classifier (UnderSampling) | M18 : GradientBoost Classifier (UnderSampling) | |
|---|---|---|---|---|---|---|
| Accuracy | 0.941598 | 1.0 | 0.996414 | 1.0 | 0.985656 | 0.996926 |
| Recall | 0.940574 | 1.0 | 0.994877 | 1.0 | 0.987705 | 0.996926 |
| Precision | 0.942505 | 1.0 | 0.997945 | 1.0 | 0.983673 | 0.996926 |
| F1 | 0.941538 | 1.0 | 0.996408 | 1.0 | 0.985685 | 0.996926 |
# training performance comparison
models_test_comp_df_us = pd.concat(
[m13_lg_model_test_perf.T, m14_d_tree_model_test_perf.T,m15_bagging_estimator_test_perf.T,m16_rf_estimator_test_perf.T,
m17_adaBoost_classfr_test_perf.T,m18_gradBoost_classfr_test_perf.T ],
axis=1,
)
models_test_comp_df_us.columns = [
"M13 : Logistic Regression (UnderSampling)",
"M14 : Decision Tree (UnderSampling)",
"M15 : Bagging with Decision Tree (UnderSampling)",
"M16 : Random Forest Estimator (UnderSampling)",
"M17 : AdaBoost Classifier (UnderSampling)",
"M18 : GradientBoost Classifier (UnderSampling)"]
print("Testing performance comparison (UnderSampling):")
models_test_comp_df_us
Testing performance comparison (UnderSampling):
| M13 : Logistic Regression (UnderSampling) | M14 : Decision Tree (UnderSampling) | M15 : Bagging with Decision Tree (UnderSampling) | M16 : Random Forest Estimator (UnderSampling) | M17 : AdaBoost Classifier (UnderSampling) | M18 : GradientBoost Classifier (UnderSampling) | |
|---|---|---|---|---|---|---|
| Accuracy | 0.594274 | 0.721125 | 0.780849 | 0.696940 | 0.721619 | 0.674729 |
| Recall | 0.950769 | 0.932308 | 0.966154 | 0.978462 | 0.978462 | 0.984615 |
| Precision | 0.277130 | 0.358156 | 0.420348 | 0.343784 | 0.363429 | 0.328542 |
| F1 | 0.429167 | 0.517506 | 0.585821 | 0.508800 | 0.530000 | 0.492687 |
# training performance comparison
models_train_comp_best = pd.concat(
[m5_adaBoost_classfr_train_perf.T,m6_gradBoost_classfr_train_perf.T , m11_adaBoost_classfr_train_perf.T,
m12_gradBoost_classfr_train_perf.T ],
axis=1,
)
models_train_comp_best.columns = [
"M5 : AdaBoost Classifier (Without Sampling)",
"M6 : GradientBoost Classifier (Without Sampling)",
"M11 : AdaBoost Classifier (OverSampling)",
"M12: GradientBoost Classifier (OverSampling)"
]
print("Training performance comparison:")
models_train_comp_best
Training performance comparison:
| M5 : AdaBoost Classifier (Without Sampling) | M6 : GradientBoost Classifier (Without Sampling) | M11 : AdaBoost Classifier (OverSampling) | M12: GradientBoost Classifier (OverSampling) | |
|---|---|---|---|---|
| Accuracy | 0.958519 | 0.974156 | 0.956462 | 0.973230 |
| Recall | 0.847336 | 0.881148 | 0.961169 | 0.979212 |
| Precision | 0.889247 | 0.954495 | 0.952205 | 0.967636 |
| F1 | 0.867786 | 0.916356 | 0.956666 | 0.973389 |
# testing performance comparison
models_test_comp_best = pd.concat(
[ m5_adaBoost_classfr_test_perf.T,m6_gradBoost_classfr_test_perf.T , m11_adaBoost_classfr_test_perf.T,
m12_gradBoost_classfr_test_perf.T ],
axis=1,
)
models_test_comp_best.columns = [
"M5 : AdaBoost Classifier (Without Sampling)",
"M6 : GradientBoost Classifier (Without Sampling)",
"M11 : AdaBoost Classifier (OverSampling)",
"M12: GradientBoost Classifier (OverSampling)"]
print("Testing performance comparison")
models_test_comp_best
Testing performance comparison
| M5 : AdaBoost Classifier (Without Sampling) | M6 : GradientBoost Classifier (Without Sampling) | M11 : AdaBoost Classifier (OverSampling) | M12: GradientBoost Classifier (OverSampling) | |
|---|---|---|---|---|
| Accuracy | 0.968411 | 0.970879 | 0.940770 | 0.957058 |
| Recall | 0.901538 | 0.886154 | 0.938462 | 0.947692 |
| Precision | 0.901538 | 0.929032 | 0.753086 | 0.814815 |
| F1 | 0.901538 | 0.907087 | 0.835616 | 0.876245 |
%%time
# defining model
gradBoost_classfr_tuned = GradientBoostingClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=gradBoost_classfr_tuned, param_distributions=param_grid, n_jobs = -1, n_iter=200, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 250} with CV score=0.8493511250654109:
CPU times: user 3.19 s, sys: 162 ms, total: 3.35 s
Wall time: 19.8 s
#Calculating different metrics
gradBoost_classfr_tuned_train_perf=model_performance_classification_sklearn(randomized_cv,X_train_over,y_train_over)
print("Training performance:\n",gradBoost_classfr_tuned_train_perf)
gradBoost_classfr_tuned_val_perf=model_performance_classification_sklearn(randomized_cv,X_val,y_val)
print("\nValidation performance:\n",gradBoost_classfr_tuned_val_perf)
gradBoost_classfr_tuned_test_perf=model_performance_classification_sklearn(randomized_cv,X_test,y_test)
print("\nTesting performance:\n",gradBoost_classfr_tuned_test_perf)
confusion_matrix_sklearn(randomized_cv,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.958325 0.919592 0.996811 0.956646
Validation performance:
Accuracy Recall Precision F1
0 0.973346 0.898773 0.933121 0.915625
Testing performance:
Accuracy Recall Precision F1
0 0.977789 0.92 0.940252 0.930016
%%time
# defining model
gradBoost_classfr_tuned_os = GradientBoostingClassifier(random_state=1)
param_grid = {"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
gb_randomized_cv_os = RandomizedSearchCV(estimator=gradBoost_classfr_tuned_os, param_distributions=param_grid, n_jobs = -1, n_iter=200, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
gb_randomized_cv_os.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(gb_randomized_cv_os.best_params_,gb_randomized_cv_os.best_score_))
Best parameters are {'subsample': 0.8, 'n_estimators': 250} with CV score=0.9698025746117878:
CPU times: user 4.15 s, sys: 29.3 ms, total: 4.18 s
Wall time: 27.9 s
#Calculating different metrics
gradBoost_classfr_os_tuned_train_perf=model_performance_classification_sklearn(gb_randomized_cv_os,X_train_over,y_train_over)
print("Training performance:\n",gradBoost_classfr_os_tuned_train_perf)
gradBoost_classfr_tuned_os_val_perf=model_performance_classification_sklearn(gb_randomized_cv_os,X_val,y_val)
print("\nValidation performance:\n",gradBoost_classfr_tuned_os_val_perf)
gradBoost_classfr_os_tuned_test_perf=model_performance_classification_sklearn(gb_randomized_cv_os,X_test,y_test)
print("\nTesting performance:\n",gradBoost_classfr_tuned_os_test_perf)
confusion_matrix_sklearn(gb_randomized_cv_os,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.988429 0.992155 0.984816 0.988472
Validation performance:
Accuracy Recall Precision F1
0 0.964956 0.920245 0.869565 0.894188
Testing performance:
Accuracy Recall Precision F1
0 0.969398 0.953846 0.868347 0.909091
%%time
# defining model
adaboost_classfr_os_tuned= AdaBoostClassifier(random_state=1)
# Parameter grid to pass in GridSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
adaboost_classfr_os_tuned_random_cv = RandomizedSearchCV(estimator=adaboost_classfr_os_tuned, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
adaboost_classfr_os_tuned_random_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(adaboost_classfr_os_tuned_random_cv.best_params_,adaboost_classfr_os_tuned_random_cv.best_score_))
Best parameters are {'n_estimators': 90, 'learning_rate': 0.2, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9719588600896689:
CPU times: user 2.69 s, sys: 113 ms, total: 2.81 s
Wall time: 34.1 s
# building model with best parameters
adb_tuned2 = AdaBoostClassifier(
n_estimators=90,
learning_rate=0.2,
random_state=1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
# Fit the model on training data
adb_tuned2.fit(X_train_over, y_train_over)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=90, random_state=1)
#Calculating different metrics
adaBoost_classfr_tuned_train_perf=model_performance_classification_sklearn(adb_tuned2,X_train_over,y_train_over)
print("Training performance:\n",adaBoost_classfr_tuned_train_perf)
adaBoost_classfr_tuned_val_perf=model_performance_classification_sklearn(adb_tuned2,X_val,y_val)
print("\nValidation performance:\n",adaBoost_classfr_tuned_val_perf)
adaBoost_classfr_tuned_test_perf=model_performance_classification_sklearn(adb_tuned2,X_test,y_test)
print("\nTesting performance:\n",adaBoost_classfr_tuned_test_perf)
confusion_matrix_sklearn(adb_tuned2,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.988037 0.990783 0.985372 0.98807
Validation performance:
Accuracy Recall Precision F1
0 0.963968 0.917178 0.866667 0.891207
Testing performance:
Accuracy Recall Precision F1
0 0.966436 0.950769 0.855956 0.900875
# training performance comparison
models_train_comp_best = pd.concat(
[gradBoost_classfr_tuned_train_perf.T , adaBoost_classfr_tuned_train_perf.T , gradBoost_classfr_os_tuned_train_perf.T],
axis=1,
)
models_train_comp_best.columns = [
"Tuned GradientBoost Classifier (Without Sampling)",
"Tuned AdaBoost Classifier (OverSampling)",
"Tuned GradientBoost Classifier (OverSampling)"
]
print("Training performance comparison:")
models_train_comp_best
Training performance comparison:
| Tuned GradientBoost Classifier (Without Sampling) | Tuned AdaBoost Classifier (OverSampling) | Tuned GradientBoost Classifier (OverSampling) | |
|---|---|---|---|
| Accuracy | 0.958325 | 0.988037 | 0.988429 |
| Recall | 0.919592 | 0.990783 | 0.992155 |
| Precision | 0.996811 | 0.985372 | 0.984816 |
| F1 | 0.956646 | 0.988070 | 0.988472 |
# testing performance comparison
models_test_comp_best = pd.concat(
[ gradBoost_classfr_tuned_test_perf.T, adaBoost_classfr_tuned_test_perf.T,
gradBoost_classfr_os_tuned_test_perf.T ],
axis=1,
)
models_test_comp_best.columns = [
"Tuned GradientBoost Classifier (Without Sampling)",
"Tuned AdaBoost Classifier (OverSampling)",
"Tuned GradientBoost Classifier (OverSampling)"]
print("Testing performance comparison")
models_test_comp_best
Testing performance comparison
| Tuned GradientBoost Classifier (Without Sampling) | Tuned AdaBoost Classifier (OverSampling) | Tuned GradientBoost Classifier (OverSampling) | |
|---|---|---|---|
| Accuracy | 0.977789 | 0.966436 | 0.969398 |
| Recall | 0.920000 | 0.950769 | 0.953846 |
| Precision | 0.940252 | 0.855956 | 0.868347 |
| F1 | 0.930016 | 0.900875 | 0.909091 |
# Fit the best model with the best parameters
gradBoost_classfr_tuned_os_model = GradientBoostingClassifier(random_state=1 ,subsample= 0.8, n_estimators= 250 )
gradBoost_classfr_tuned_os_model.fit(X_train_over,y_train_over)
gradBoost_classfr_os_tuned_train_perf=model_performance_classification_sklearn(gradBoost_classfr_tuned_os_model,X_train_over,y_train_over)
print("Training performance:\n",gradBoost_classfr_os_tuned_train_perf)
gradBoost_classfr_tuned_os_val_perf=model_performance_classification_sklearn(gradBoost_classfr_tuned_os_model,X_val,y_val)
print("\nValidation performance:\n",gradBoost_classfr_tuned_os_val_perf)
gradBoost_classfr_os_tuned_test_perf=model_performance_classification_sklearn(gradBoost_classfr_tuned_os_model,X_test,y_test)
print("\nTesting performance:\n",gradBoost_classfr_tuned_os_test_perf)
confusion_matrix_sklearn(gradBoost_classfr_tuned_os_model,X_test,y_test)
Training performance:
Accuracy Recall Precision F1
0 0.988429 0.992155 0.984816 0.988472
Validation performance:
Accuracy Recall Precision F1
0 0.964956 0.920245 0.869565 0.894188
Testing performance:
Accuracy Recall Precision F1
0 0.969398 0.953846 0.868347 0.909091
For categorical columns, we will do one hot encoding and missing value imputation as pre-processing
We are doing missing value imputation for the whole data, so that if there is any missing value in the data in future that can be taken care of.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null int64 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 9015 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(5) memory usage: 1.5+ MB
# creating a list of numerical variables
numerical_features = [
"Customer_Age",
"Dependent_count",
"Months_on_book",
"Total_Relationship_Count",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio"
]
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
# creating a list of categorical variables
categorical_features = ["Gender", "Education_Level","Marital_Status" ,"Income_Category","Card_Category"]
# creating a transformer for categorical variables, which will first apply simple imputer and
#then do one hot encoding for categorical variables
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# handle_unknown = "ignore", allows model to handle any unknown category in the test data
# combining categorical transformer and numerical transformer using a column transformer
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="passthrough",
)
# remainder = "passthrough" has been used, it will allow variables that are present in original data
# but not in "numerical_columns" and "categorical_columns" to pass through the column transformer without any changes
# Separating target variable and other variables
X = data.drop(columns="Attrition_Flag")
Y = data["Attrition_Flag"]
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
(7088, 19) (3039, 19)
# Creating new pipeline with best parameters
model = Pipeline(
steps=[
("pre", preprocessor),
("SMOTE", SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)),
(
"GBC",
GradientBoostingClassifier(
random_state=1,
subsample= 0.8,
n_estimators= 250,
),
),
]
)
# Fit the model on training data
model.fit(X_train, y_train)
Pipeline(steps=[('pre',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='median'))]),
['Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Credit_Limit',
'Total_Revolving_Bal',
'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_...
'Total_Trans_Ct',
'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='most_frequent')),
('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['Gender', 'Education_Level',
'Marital_Status',
'Income_Category',
'Card_Category'])])),
('GBC',
GradientBoostingClassifier(n_estimators=250, random_state=1,
subsample=0.8))])
# Plot the importance features
feature_names = X_train_over.columns
importances = gradBoost_classfr_tuned_os_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(20, 20))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
data1 = data.copy()
data1.replace({"Attrition_Flag": {0:'Existing Customer' , 1:'Attrited Customer' }}, inplace=True)
plt.figure(figsize=(15,7))
sns.scatterplot(data1["Total_Trans_Amt"],data1["Total_Trans_Ct"],hue=data1["Attrition_Flag"],ci=0.,palette="hot")
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
plt.figure(figsize=(15,7))
sns.scatterplot(data1["Total_Revolving_Bal"],data1["Total_Ct_Chng_Q4_Q1"],hue=data1["Attrition_Flag"],ci=0.,palette="hot")
plt.legend(bbox_to_anchor=(1.00, 1))
plt.show()
Total transaction amount is higher for existing customers than that of churned customers. The bank needs to create outreach programs to target the customers with transaction amount lesser than 10,000 dolalrs on their credit card.
Total transaction count is higher for existing customers than that of churned customers. The bank needs to provide promotional offers to increase engagement wifor those customers with fewer than 80 transactions in the last 12 months.
Customers with total revolving balance of lesser than 500 tend to leave the bank. Therefore these customers need to be targeted in the campaigns.